Analysis of NYC Boroughs to Buy a House

The objective of this project is to study different Bouroughs of NYC to find out optimal location to buy a property.

Importing all the required libraries:

In [128]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# for geospatial data visualization
import folium
from folium.plugins import FastMarkerCluster

import datetime

# for web scraping and dealing with json data
import requests
import json
from pandas.io.json import json_normalize

# to get longitudes and Latitudes of addresses 
from geopy.geocoders import Nominatim

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')
Libraries imported.

Importing DOB Permit Issuance dataset:

In [129]:
df_Permit = pd.read_csv("C:/Users/archd/OneDrive/Desktop/Projects/DOB_Permit_Issuance.csv")
print("Data imported.")
C:\Users\archd\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3020: DtypeWarning: Columns (1,8,9,10,15,25,31,33,34,35,36,51,52) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Data imported.
In [130]:
df_Permit.head()
Out[130]:
BOROUGH Bin # House # Street Name Job # Job doc. # Job Type Self_Cert Block Lot Community Board Zip Code Bldg Type Residential Special District 1 Special District 2 Work Type Permit Status Filing Status Permit Type Permit Sequence # Permit Subtype Oil Gas Site Fill Filing Date Issuance Date Expiration Date Job Start Date Permittee's First Name Permittee's Last Name Permittee's Business Name Permittee's Phone # Permittee's License Type Permittee's License # Act as Superintendent Permittee's Other Title HIC License Site Safety Mgr's First Name Site Safety Mgr's Last Name Site Safety Mgr Business Name Superintendent First & Last Name Superintendent Business Name Owner's Business Type Non-Profit Owner's Business Name Owner's First Name Owner's Last Name Owner's House # Owner's House Street Name Owner’s House City Owner’s House State Owner’s House Zip Code Owner's Phone # DOBRunDate PERMIT_SI_NO LATITUDE LONGITUDE COUNCIL_DISTRICT CENSUS_TRACT NTA_NAME
0 BRONX 2102476 200 E 135TH STREET 240249842 1 A2 Y 2319 120 201 10451.0 2.0 NaN NaN NaN PL ISSUED INITIAL PL 1 NaN NaN NOT APPLICABLE 03/26/2019 12:00:00 AM 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM 03/26/2019 12:00:00 AM ALAN CHUMSKY ECONOMY P & H CO.,INC 7183928867 MP 1548 NaN NaN NaN NaN NaN NaN NaN NaN PARTNERSHIP N CUBESMART LP MARK SHORTLIDGE 5 OLD LANCATER ROAD MALVERN PA 19355 6.10401e+09 03/27/2019 12:00:00 AM 3608996 40.810699 -73.931313 8.0 51.0 Mott Haven-Port Morris
1 MANHATTAN 1090833 249 WEST 14 STREET 140818435 1 A3 Y 764 10 104 10011.0 2.0 NaN NaN NaN EQ ISSUED INITIAL EQ 1 OT NaN NaN 03/26/2019 12:00:00 AM 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM 03/26/2019 12:00:00 AM WILLIAM LAFFEY SPRING SCAFFOLDING LLC 7183924921 GC 607447 NaN NaN NaN NaN NaN NaN NaN NaN PARTNERSHIP N SPRING SCAFFOLDING LLC WILLIAM LAFFEY 49-30 31ST PLACE LIC NY 11101 7.18392e+09 03/27/2019 12:00:00 AM 3609359 40.739151 -74.001075 3.0 81.0 Hudson Yards-Chelsea-Flatiron-Union Square
2 MANHATTAN 1015903 20 WEST 36TH STREET 123372466 2 A2 Y 837 58 105 10018.0 2.0 NaN NaN NaN MH ISSUED INITIAL EW 1 MH NaN NaN 03/26/2019 12:00:00 AM 03/26/2019 12:00:00 AM 07/18/2019 12:00:00 AM 03/26/2019 12:00:00 AM DANIEL HULBERT MILLSTONE CONSTRUCTION AS 9144943189 GC 619626 NaN NaN NaN NaN NaN NaN NaN NaN CORPORATION N STATECOURT EN, C/O KOEPPEL ROSEN DAVID KOEPPEL 40 EAST 69TH STREET NEW YORK NY 10021 2.12249e+09 03/27/2019 12:00:00 AM 3609159 40.750090 -73.984513 4.0 84.0 Midtown-Midtown South
3 QUEENS 4467709 14-30 BROADWAY 421105618 1 A2 Y 529 2 401 11106.0 2.0 NaN NaN NaN MH ISSUED RENEWAL EW 2 MH NaN NOT APPLICABLE 03/26/2019 12:00:00 AM 03/26/2019 12:00:00 AM 09/15/2019 12:00:00 AM 07/20/2018 12:00:00 AM AHMAD REYAZ ACS SYSTEM ASSOCIATES INC 9146655800 GC 14698 NaN NaN NaN NaN NaN NaN NaN NaN NYC AGENCY N NYC SCA EFTIHIA TSITIRIDIS 30-30 THOMSON AVENUE LONG ISLAND CIT NY 11101 7.18473e+09 03/27/2019 12:00:00 AM 3609358 40.765826 -73.932800 22.0 45.0 Astoria
4 QUEENS 4048873 42-06 108 STREET 421024689 1 A1 N 1986 37 404 11368.0 2.0 YES NaN NaN NaN ISSUED RENEWAL AL 5 NaN NaN NOT APPLICABLE 03/21/2019 12:00:00 AM 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM 02/21/2017 12:00:00 AM WILLIAM WONG SMVP MGMT INC 6318614718 GC 618937 NaN NaN NaN NaN NaN NaN JACKSON TAM NaN INDIVIDUAL N NaN CALOGERO MULE 42-06 108 STREET QUEENS NY 11368 7.18279e+09 03/27/2019 12:00:00 AM 3606599 40.749221 -73.858003 21.0 403.0 North Corona

Checking null values in the dataset:

In [131]:
df_Permit.isnull().sum()
Out[131]:
BOROUGH                                   0
Bin #                                     0
House #                                   4
Street Name                               4
Job #                                     0
Job doc. #                                0
Job Type                                  0
Self_Cert                           1274022
Block                                   499
Lot                                     508
Community Board                        4757
Zip Code                               2221
Bldg Type                             54477
Residential                         2139591
Special District 1                  3121182
Special District 2                  3439516
Work Type                            609717
Permit Status                         10813
Filing Status                             0
Permit Type                               1
Permit Sequence #                         0
Permit Subtype                      1393411
Oil Gas                             3470104
Site Fill                            417333
Filing Date                               1
Issuance Date                         19972
Expiration Date                       11143
Job Start Date                           30
Permittee's First Name                15732
Permittee's Last Name                 15748
Permittee's Business Name             48868
Permittee's Phone #                   15963
Permittee's License Type             269081
Permittee's License #                238702
Act as Superintendent               1912116
Permittee's Other Title             3236862
HIC License                         3477843
Site Safety Mgr's First Name        3481885
Site Safety Mgr's Last Name         3481861
Site Safety Mgr Business Name       3490529
Superintendent First & Last Name    1814931
Superintendent Business Name        1847714
Owner's Business Type                164588
Non-Profit                           160499
Owner's Business Name                715709
Owner's First Name                     1858
Owner's Last Name                      1553
Owner's House #                        1125
Owner's House Street Name              1369
Owner’s House City                      706
Owner’s House State                     679
Owner’s House Zip Code                 6175
Owner's Phone #                       49040
DOBRunDate                                0
PERMIT_SI_NO                              0
LATITUDE                              12258
LONGITUDE                             12258
COUNCIL_DISTRICT                      12258
CENSUS_TRACT                          12258
NTA_NAME                              12258
dtype: int64

Checking number of observations and variables in the dataset.

In [132]:
df_Permit.shape
Out[132]:
(3508249, 60)

The DOB Permit Issuance dataset has 3508249 observations and 60 variables. For this analysis we do not require all 60 variables and after studying all the variables, I have decided to keep only 14 variables which are described below.

  • Bin # : Building Identification Number assigned by Department of City Planning.
  • BOROUGH : The name of the NYC borough where the proposed work will take place.
  • Job Type : 2-digit code to indicate the overall job type for the application. (A1 : Major Alteration, A2 : Cosiderable Alteration, A3 : Minor Alteration, NB : New Building, DM : Demolition, SG : Sign)
  • Residential : If the building will be used for residential purpose or not.
  • Permit Status : The current status of the permit application.
  • Filing Status : Indicates if this is the first time the permit is being applied for or if the permit is being renewed.
  • Oil Gas : If the permit is for work on fuel burning equipment, this indicates whether it burns oil or gas.
  • Site Fill : This indicates the source of any fill dirt that will be used on the construction site.
  • Issuance Date : The date the permit was issued.
  • Expiration Date : The date that the permit expires.
  • Non-Profit : Indicates if the building is owned by a non-profit.
  • Owner's Business Type : Indicates the type of entity that owns the building where the work will be performed.
  • LATITUDE : Latitude for the building where the proposed work will take place.
  • LONGITUDE : Longitude for the building where the proposed work will take place.
In [133]:
cols = ["Bin #","BOROUGH","Job Type", "Residential","Permit Status","Filing Status","Oil Gas","Site Fill",
            "Issuance Date", "Expiration Date", "Non-Profit","Owner's Business Type","LATITUDE","LONGITUDE"]
In [134]:
df_Permit = df_Permit[cols]
df_Permit.shape
Out[134]:
(3508249, 14)
In [135]:
df_Permit.isnull().sum()
Out[135]:
Bin #                          0
BOROUGH                        0
Job Type                       0
Residential              2139591
Permit Status              10813
Filing Status                  0
Oil Gas                  3470104
Site Fill                 417333
Issuance Date              19972
Expiration Date            11143
Non-Profit                160499
Owner's Business Type     164588
LATITUDE                   12258
LONGITUDE                  12258
dtype: int64

EDA

Latitude and Longitude of the building

In [136]:
df_Permit[["LONGITUDE","LATITUDE"]].head()
Out[136]:
LONGITUDE LATITUDE
0 -73.931313 40.810699
1 -74.001075 40.739151
2 -73.984513 40.750090
3 -73.932800 40.765826
4 -73.858003 40.749221
In [137]:
df_Permit[["LONGITUDE","LATITUDE"]].isnull().sum()
Out[137]:
LONGITUDE    12258
LATITUDE     12258
dtype: int64

We will have to drop the observations which do not have longitude and latitude values as we cannnot know exact locations of buildings without them. We can fill the values of Longitudes and Latitudes depends on the address but it should be perfect. Also, finding longitudes and latitudes of 12000 observations will require time as well as computation power.

In [138]:
df_Permit = df_Permit.dropna(subset = ["LONGITUDE","LATITUDE"])
In [139]:
df_Permit[["LONGITUDE","LATITUDE"]].isnull().sum()
Out[139]:
LONGITUDE    0
LATITUDE     0
dtype: int64

Permit Issuance Date, Expiration Date and Permit Status

In [140]:
df_Permit[["Issuance Date", "Expiration Date"]].head(10)
Out[140]:
Issuance Date Expiration Date
0 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM
1 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM
2 03/26/2019 12:00:00 AM 07/18/2019 12:00:00 AM
3 03/26/2019 12:00:00 AM 09/15/2019 12:00:00 AM
4 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM
5 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM
6 03/26/2019 12:00:00 AM 02/16/2020 12:00:00 AM
7 03/26/2019 12:00:00 AM 03/25/2020 12:00:00 AM
8 03/26/2019 12:00:00 AM 04/01/2020 12:00:00 AM
9 03/26/2019 12:00:00 AM 03/15/2020 12:00:00 AM
In [141]:
df_Permit[["Issuance Date", "Expiration Date"]].isnull().sum()
Out[141]:
Issuance Date      19928
Expiration Date    11123
dtype: int64
In [142]:
df_Permit[["Issuance Date", "Expiration Date"]][df_Permit["Issuance Date"].isnull()].head(10)
Out[142]:
Issuance Date Expiration Date
3488277 NaN NaN
3488278 NaN NaN
3488279 NaN 12/12/2001 12:00:00 AM
3488280 NaN NaN
3488281 NaN NaN
3488282 NaN 12/31/2001 12:00:00 AM
3488283 NaN NaN
3488284 NaN NaN
3488285 NaN NaN
3488286 NaN 11/27/2002 12:00:00 AM
In [143]:
df_Permit[["Issuance Date", "Expiration Date"]][df_Permit["Expiration Date"].isnull()].head(10)
Out[143]:
Issuance Date Expiration Date
1578009 09/02/2009 12:00:00 AM NaN
1609300 06/15/2009 12:00:00 AM NaN
1628844 04/24/2009 12:00:00 AM NaN
3123897 11/07/1996 12:00:00 AM NaN
3488277 NaN NaN
3488278 NaN NaN
3488280 NaN NaN
3488281 NaN NaN
3488283 NaN NaN
3488284 NaN NaN

All of these dates are in string format. I will have to convert them into Datetime format.

In [144]:
df_Permit["Issuance Date"] = pd.to_datetime(df_Permit["Issuance Date"])
In [145]:
df_Permit["Issuance Date"].head()
Out[145]:
0   2019-03-26
1   2019-03-26
2   2019-03-26
3   2019-03-26
4   2019-03-26
Name: Issuance Date, dtype: datetime64[ns]
In [146]:
df_Permit["Expiration Date"] = pd.to_datetime(df_Permit["Expiration Date"])
In [147]:
df_Permit["Expiration Date"].head()
Out[147]:
0   2020-03-25
1   2020-03-25
2   2019-07-18
3   2019-09-15
4   2020-03-25
Name: Expiration Date, dtype: datetime64[ns]

I am interested into those buildings whose permits expire in the future. This will tell me which buildings are under construction right now. For this I will consider only those buildings whose permit expiration date is later than today's date.

In [298]:
present = str(datetime.datetime.now().date())
df_Permit = df_Permit[(df_Permit["Expiration Date"] > present) | (df_Permit["Expiration Date"].isnull())] 
In [300]:
df_Permit["Permit Status"].value_counts()
Out[300]:
ISSUED        121622
IN PROCESS     11840
RE-ISSUED       1094
Name: Permit Status, dtype: int64
In [301]:
df_Permit[["Permit Status","Expiration Date"]][df_Permit["Permit Status"] == "IN PROCESS"].head(20)
Out[301]:
Permit Status Expiration Date
38231 IN PROCESS 2019-08-20
46365 IN PROCESS 2019-12-01
92760 IN PROCESS 2019-08-29
105266 IN PROCESS 2019-06-15
3488277 IN PROCESS NaT
3488278 IN PROCESS NaT
3488280 IN PROCESS NaT
3488281 IN PROCESS NaT
3488283 IN PROCESS NaT
3488284 IN PROCESS NaT
3488285 IN PROCESS NaT
3488287 IN PROCESS NaT
3488289 IN PROCESS NaT
3488290 IN PROCESS NaT
3488291 IN PROCESS NaT
3488294 IN PROCESS NaT
3488295 IN PROCESS NaT
3488296 IN PROCESS NaT
3488297 IN PROCESS NaT
3488299 IN PROCESS NaT

As we can see for buildings whose permit applications are in process, do not have permit expiration date. This is right and we will have to keep these observations in our analysis.

In [302]:
df_Permit = df_Permit[(df_Permit["Expiration Date"].notnull()) | ((df_Permit["Permit Status"] == "IN PROCESS") & (df_Permit["Expiration Date"].isnull()))]
In [303]:
df_Permit.shape
Out[303]:
(134581, 14)
In [304]:
df_Permit[["Permit Status","Expiration Date"]][df_Permit["Permit Status"].isnull()].head()
Out[304]:
Permit Status Expiration Date
5758 NaN 2019-07-06
32201 NaN 2019-06-02
34923 NaN 2019-08-10
46392 NaN 2019-06-14
58146 NaN 2019-11-09

As we can see for all null values of Permit Status, Expiration date exsist. This tells us that Permit Status is issued. We can fill the null values of Permit Status as ISSUED. Also, I am renaming RE-ISSUED as ISSUED because I want to findout which buldings will be under construction in the future.

In [305]:
df_Permit["Permit Status"] = df_Permit["Permit Status"].fillna("ISSUED")
df_Permit["Permit Status"] = df_Permit["Permit Status"].replace({"RE-ISSUED":"ISSUED"})
In [306]:
df_Permit["Permit Status"].value_counts()
Out[306]:
ISSUED        122741
IN PROCESS     11840
Name: Permit Status, dtype: int64
In [307]:
df_Permit["Permit Status"].isnull().sum()
Out[307]:
0
In [308]:
sns.countplot(df_Permit["Permit Status"])
Out[308]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce07d9b2b0>

Let's Visualize for which buildings Applications are under process as they will be under construction in near future.

In [309]:
temp_df = df_Permit[df_Permit["Permit Status"]=="IN PROCESS"]
In [310]:
temp_df.shape
Out[310]:
(11840, 14)
In [311]:
boroughs = df_Permit["BOROUGH"].unique()
In [312]:
temp_df2 = df_Permit[["BOROUGH","LONGITUDE","LATITUDE"]]
borough_temp_df = pd.DataFrame(columns = ["BOROUGH","LONGITUDE","LATITUDE"])
borough_temp_df
Out[312]:
BOROUGH LONGITUDE LATITUDE
In [313]:
for i in range(len(boroughs)):
    borough_temp_df = borough_temp_df.append({'BOROUGH': boroughs[i], 'LONGITUDE': temp_df2["LONGITUDE"][temp_df2["BOROUGH"]==boroughs[i]].mean(), 'LATITUDE': temp_df2["LATITUDE"][temp_df2["BOROUGH"]==boroughs[i]].mean()}, ignore_index=True)
In [314]:
borough_temp_df
Out[314]:
BOROUGH LONGITUDE LATITUDE
0 BRONX -73.884201 40.847302
1 MANHATTAN -73.979645 40.758475
2 QUEENS -73.838414 40.719263
3 BROOKLYN -73.961086 40.661829
4 STATEN ISLAND -74.135433 40.586917
In [315]:
address = 'New York City, NY'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
C:\Users\archd\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: DeprecationWarning: Using Nominatim with the default "geopy/1.19.0" `user_agent` is strongly discouraged, as it violates Nominatim's ToS https://operations.osmfoundation.org/policies/nominatim/ and may possibly cause 403 and 429 HTTP errors. Please specify a custom `user_agent` with `Nominatim(user_agent="my-application")` or by overriding the default `user_agent`: `geopy.geocoders.options.default_user_agent = "my-application"`. In geopy 2.0 this will become an exception.
  This is separate from the ipykernel package so we can avoid doing imports until
The geograpical coordinate of New York City are 40.7308619, -73.9871558.
In [316]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

map_newyork.add_child(FastMarkerCluster(temp_df[["LATITUDE","LONGITUDE"]].values.tolist()))

# add markers to map
for lat, lng, borough in zip(borough_temp_df['LATITUDE'], borough_temp_df['LONGITUDE'], borough_temp_df['BOROUGH']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=50,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  
    
map_newyork
Out[316]:
In [317]:
plt.figure(figsize = (8,5))
sns.countplot(x = df_Permit["BOROUGH"], hue = df_Permit["Permit Status"])
plt.show()

We can find the length of time for which the permit is issued from the Permit Issuance Date and Permit Expiration Date.

In [318]:
df_Permit["Permit Length"] = (df_Permit["Expiration Date"] - df_Permit["Issuance Date"]).astype('timedelta64[D]').values
In [319]:
df_Permit["Permit Length"].head()
Out[319]:
0    365.0
1    365.0
2    114.0
3    173.0
4    365.0
Name: Permit Length, dtype: float64
In [320]:
df_Permit["Permit Length"][df_Permit["Permit Length"].notnull()].describe()
Out[320]:
count    122744.000000
mean        330.916216
std         604.072007
min           8.000000
25%         317.000000
50%         365.000000
75%         365.000000
max       73551.000000
Name: Permit Length, dtype: float64
In [321]:
m = df_Permit["Permit Length"][df_Permit["Permit Length"].notnull()].mean()
m
Out[321]:
330.9162158639119

We have null values in the variable Permit Length as either Issuance or Expiration Date is null. I am filling this value with the mean value.

In [322]:
df_Permit["Permit Length"] = df_Permit["Permit Length"].fillna(m)

Now we do not need Permit Issuance Date and Expiration Date. I am dropping them.

In [323]:
df_Permit.drop(["Issuance Date", "Expiration Date"], axis =1, inplace = True)

Borough

In [324]:
df_Permit["BOROUGH"].value_counts()
Out[324]:
MANHATTAN        54915
BROOKLYN         36572
QUEENS           27155
BRONX            10474
STATEN ISLAND     5465
Name: BOROUGH, dtype: int64
In [325]:
plt.figure(figsize = (8,5))
sns.countplot(df_Permit["BOROUGH"],order = df_Permit['BOROUGH'].value_counts().index,palette=("Blues_d"))
plt.show()

As we can see Manhattan is rapidly growing as so most of the DOB permits are issued for buildings of this borough.

In [326]:
temp_df = df_Permit[df_Permit["BOROUGH"]=="MANHATTAN"]
In [327]:
sns.countplot(x = temp_df["Job Type"],order = temp_df["Job Type"].value_counts().index)
plt.title("Type of Job in Manhattan")
Out[327]:
Text(0.5, 1.0, 'Type of Job in Manhattan')

In Manhattan most of the permits are issued for considerable alterations in buildings followed by minor and most alterations. The permits issued for new buildings are very less which tells us that this borough is developed from the long time and buying a property here will be really costly. Since many alterations are going on, there will be a considerable noise pollution as well in this borough.

Job Type

In [328]:
df_Permit["Job Type"].value_counts()
Out[328]:
A2    92833
NB    14741
A1    13928
A3    10653
DM     1752
SG      674
Name: Job Type, dtype: int64
In [329]:
sns.countplot(x = df_Permit["Job Type"],order = df_Permit["Job Type"].value_counts().index)
plt.title("Type of Job in NYC")
Out[329]:
Text(0.5, 1.0, 'Type of Job in NYC')

As we can see most of the permits are issued for considerable alterations followed by new buidlings, major alterations, minor alterations, Demolition and Sign changes.

Let's compare different boroughs and job types
In [330]:
plt.figure(figsize = (12, 8))
sns.countplot(x = df_Permit["Job Type"], hue = df_Permit["BOROUGH"])
plt.show()

Above graph tells us that Queens and Brooklyn are developing boroughs as many new buildings are being built there. Let's see them in map.

In [331]:
temp_df = df_Permit[df_Permit["Job Type"]=="NB"]
In [332]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

map_newyork.add_child(FastMarkerCluster(temp_df[["LATITUDE","LONGITUDE"]].values.tolist()))

# add markers as boroughs to map
for lat, lng, borough in zip(borough_temp_df['LATITUDE'], borough_temp_df['LONGITUDE'], borough_temp_df['BOROUGH']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=50,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  
    
map_newyork
Out[332]:

Residential

In [333]:
df_Permit["Residential"].value_counts()
Out[333]:
YES    73172
Name: Residential, dtype: int64

We can fill null values of Residential as "NO" as they might not be used for residential purpose.

In [334]:
df_Permit["Residential"] = df_Permit["Residential"].fillna("NO")
In [335]:
df_Permit["Residential"].value_counts()
Out[335]:
YES    73172
NO     61409
Name: Residential, dtype: int64
In [336]:
sns.countplot(x = df_Permit["Residential"])
Out[336]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cdccda0908>
In [337]:
plt.figure(figsize = (8,5))
sns.countplot(x = df_Permit["BOROUGH"], hue = df_Permit["Residential"])
plt.show()

Brooklyn and Queens have most of the properties for residential purpose. Also, many new buildings are being built there. Those boroughs can be very good choices for buying houses/apartments.

In [338]:
temp_df = df_Permit[df_Permit["Residential"]=="YES"]
In [339]:
temp_df.shape
Out[339]:
(73172, 13)
In [340]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

map_newyork.add_child(FastMarkerCluster(temp_df[["LATITUDE","LONGITUDE"]][1:55000].values.tolist()))

# add markers as boroughs to map
for lat, lng, borough in zip(borough_temp_df['LATITUDE'], borough_temp_df['LONGITUDE'], borough_temp_df['BOROUGH']):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=50,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  
    
map_newyork #showing 55000 values because of the limitation of the folium library
Out[340]:
In [341]:
df_Permit["Residential"] = df_Permit["Residential"].replace({"NO":0, "YES":1})

Oil Gas

In [342]:
df_Permit["Oil Gas"].value_counts()
Out[342]:
OIL    488
GAS     61
Name: Oil Gas, dtype: int64
In [343]:
df_Permit["Oil Gas"].isnull().sum()
Out[343]:
134032
In [344]:
df_Permit["Oil Gas"] = df_Permit["Oil Gas"].fillna("NONE")
In [345]:
sns.countplot(df_Permit["Oil Gas"])
Out[345]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce0d383e10>
In [346]:
sns.countplot(hue = df_Permit["Oil Gas"], x = df_Permit["Job Type"])
Out[346]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce0a4b05f8>

Most of the buildings do not require OIL/GAS and this variable might not be helpful for my analysis.I am dropping this variable.

In [347]:
df_Permit.drop("Oil Gas", axis =1, inplace = True)

Owner's Business Type

In [348]:
df_Permit["Owner\'s Business Type"].value_counts()
Out[348]:
CORPORATION           48006
INDIVIDUAL            38709
PARTNERSHIP           30605
CONDO/CO-OP            8246
OTHER GOV'T AGENCY     4269
NYCHA/HHC              1939
NYC AGENCY             1563
OTHER                   357
NYCHA                   316
HPD                      73
DCAS                     18
HHC                      16
DOE                      14
NY STATE                 13
Name: Owner's Business Type, dtype: int64
In [349]:
sns.countplot(df_Permit["Owner\'s Business Type"])
plt.xticks(rotation = 90)
Out[349]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13]),
 <a list of 14 Text xticklabel objects>)

For all Business types who have less few building permits, I will consider them under Other type for simplicity.

In [350]:
df_Permit["Owner\'s Business Type"] = df_Permit["Owner\'s Business Type"].replace({"NYCHA":"OTHER", "HPD":"OTHER",
                                                                                   "DCAS":"OTHER","NY STATE":"OTHER",
                                                                                   "DOE":"OTHER","HHC":"OTHER",
                                                                                   "NYC AGENCY":"OTHER","NYCHA/HHC":"OTHER",
                                                                                   "OTHER GOV'T AGENCY": "OTHER" })
In [351]:
df_Permit["Owner\'s Business Type"] = df_Permit["Owner\'s Business Type"].fillna("OTHER")
In [352]:
df_Permit["Owner\'s Business Type"].value_counts()
Out[352]:
CORPORATION    48006
INDIVIDUAL     38709
PARTNERSHIP    30605
OTHER           9015
CONDO/CO-OP     8246
Name: Owner's Business Type, dtype: int64
In [353]:
sns.countplot(df_Permit["Owner\'s Business Type"],order = df_Permit["Owner\'s Business Type"].value_counts().index)
plt.xticks(rotation = 90)
Out[353]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)

Let's visualize which businesses are building new properties.

In [354]:
temp_df = df_Permit[df_Permit["Job Type"]=="NB"]
In [355]:
temp_df.shape
Out[355]:
(14741, 12)
In [356]:
plt.figure(figsize=(8,5))
sns.countplot(temp_df["Owner\'s Business Type"], order = temp_df["Owner\'s Business Type"].value_counts().index)
plt.title("New Buildings under construction by Different types of businesses")
plt.show()
In [357]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, businessType in zip(temp_df['LATITUDE'][1:1000], temp_df['LONGITUDE'][1:1000], temp_df['Owner\'s Business Type'][1:1000]):
    label = '{}'.format(businessType)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  
    
map_newyork
Out[357]:

Most of the new buildings are being built by Corporation followed by Individual and Partnership.

Non-Profit

In [358]:
df_Permit["Non-Profit"].value_counts()
Out[358]:
N    127193
Y      6985
Name: Non-Profit, dtype: int64
In [359]:
df_Permit["Non-Profit"].isnull().sum()
Out[359]:
403
In [360]:
df_Permit["Non-Profit"] = df_Permit["Non-Profit"].fillna("N")
In [361]:
sns.countplot(df_Permit["Non-Profit"])
Out[361]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce0bcb3b38>
In [362]:
plt.figure(figsize = (8,5))
sns.countplot(x = df_Permit["BOROUGH"], hue = df_Permit["Non-Profit"])
plt.show()
In [363]:
df_Permit["Non-Profit"] = df_Permit["Non-Profit"].replace({"N":0, "Y":1})
In [364]:
temp_df = df_Permit[df_Permit["Non-Profit"]== 1]
In [365]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough in zip(temp_df['LATITUDE'], temp_df['LONGITUDE'], temp_df['BOROUGH'][1:1000]):
    label = '{}'.format(borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_newyork)  
    
map_newyork
Out[365]:

The above map shows the non-profit buildings for which permits are issued. Manhattan has maximum density of non-profit organizations.

Site Fill

In [366]:
df_Permit["Site Fill"].value_counts()
Out[366]:
NOT APPLICABLE         84180
ON-SITE                 8475
USE UNDER 300 CU.YD     8286
NONE                    4957
OFF-SITE                1608
Name: Site Fill, dtype: int64
In [367]:
df_Permit["Site Fill"] = df_Permit["Site Fill"].fillna("NONE")
In [368]:
df_Permit["Site Fill"] = df_Permit["Site Fill"].replace({"NOT APPLICABLE":0, "NONE":0, "ON-SITE":1,"OFF-SITE":1, "USE UNDER 300 CU.YD":1 })
In [369]:
df_Permit["Site Fill"].value_counts()
Out[369]:
0    116212
1     18369
Name: Site Fill, dtype: int64
In [370]:
sns.countplot(df_Permit["Site Fill"])
Out[370]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce07a382b0>
In [371]:
plt.figure(figsize = (8,5))
sns.countplot(x = df_Permit["BOROUGH"], hue = df_Permit["Site Fill"])
plt.show()

The value of Site fill is True where mostly new buildings are being built.

Filing Status

In [372]:
df_Permit["Filing Status"].value_counts()
Out[372]:
INITIAL    77465
RENEWAL    57116
Name: Filing Status, dtype: int64
In [373]:
plt.figure(figsize = (8,5))
sns.countplot(x = df_Permit["BOROUGH"],hue = df_Permit["Filing Status"])
plt.show()
In [374]:
df_Permit["Filing Status"] = df_Permit["Filing Status"].replace({"INITIAL":1, "RENEWAL":2})
In [375]:
df_Permit.isnull().sum()
Out[375]:
Bin #                    0
BOROUGH                  0
Job Type                 0
Residential              0
Permit Status            0
Filing Status            0
Site Fill                0
Non-Profit               0
Owner's Business Type    0
LATITUDE                 0
LONGITUDE                0
Permit Length            0
dtype: int64

K-Means Clustering

I do not require Building Identification Number and I am dropping it.

In [376]:
df_Permit.drop("Bin #", axis =1, inplace = True)
In [377]:
df_Permit.dtypes
Out[377]:
BOROUGH                   object
Job Type                  object
Residential                int64
Permit Status             object
Filing Status              int64
Site Fill                  int64
Non-Profit                 int64
Owner's Business Type     object
LATITUDE                 float64
LONGITUDE                float64
Permit Length            float64
dtype: object

We will have to remove Borough and Longitude, Latitude to make clusters.

In [378]:
df = df_Permit.drop(["BOROUGH","LATITUDE","LONGITUDE"], axis = 1)
In [379]:
catcols = df.columns[df.dtypes == object]
df2 = pd.get_dummies(df, columns=catcols)
In [380]:
df2.head()
Out[380]:
Residential Filing Status Site Fill Non-Profit Permit Length Job Type_A1 Job Type_A2 Job Type_A3 Job Type_DM Job Type_NB Job Type_SG Permit Status_IN PROCESS Permit Status_ISSUED Owner's Business Type_CONDO/CO-OP Owner's Business Type_CORPORATION Owner's Business Type_INDIVIDUAL Owner's Business Type_OTHER Owner's Business Type_PARTNERSHIP
0 0 1 0 0 365.0 0 1 0 0 0 0 0 1 0 0 0 0 1
1 0 1 0 0 365.0 0 0 1 0 0 0 0 1 0 0 0 0 1
2 0 1 0 0 114.0 0 1 0 0 0 0 0 1 0 1 0 0 0
3 0 2 0 0 173.0 0 1 0 0 0 0 0 1 0 0 0 1 0
4 1 2 0 0 365.0 1 0 0 0 0 0 0 1 0 0 1 0 0
In [381]:
df2.shape
Out[381]:
(134581, 18)
In [382]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df2)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
Out[382]:
array([2, 2, 0, 0, 2, 2, 2, 2, 2, 2])
In [383]:
df_Permit_merged = df_Permit.copy()

# add clustering labels
df_Permit_merged['Cluster Labels'] = kmeans.labels_

df_Permit_merged.head()
Out[383]:
BOROUGH Job Type Residential Permit Status Filing Status Site Fill Non-Profit Owner's Business Type LATITUDE LONGITUDE Permit Length Cluster Labels
0 BRONX A2 0 ISSUED 1 0 0 PARTNERSHIP 40.810699 -73.931313 365.0 2
1 MANHATTAN A3 0 ISSUED 1 0 0 PARTNERSHIP 40.739151 -74.001075 365.0 2
2 MANHATTAN A2 0 ISSUED 1 0 0 CORPORATION 40.750090 -73.984513 114.0 0
3 QUEENS A2 0 ISSUED 2 0 0 OTHER 40.765826 -73.932800 173.0 0
4 QUEENS A1 1 ISSUED 2 0 0 INDIVIDUAL 40.749221 -73.858003 365.0 2

Let's visualize the clusters and Borough relationship.

In [384]:
plt.figure(figsize = (10,5))
sns.countplot(x=df_Permit_merged["BOROUGH"], hue = df_Permit_merged["Cluster Labels"])
plt.show()

As we can see that, K-means clustering does not yield any good results. All these buildings do not have similar characteristics. To find similarity we need more variables to analyze.

Let's Analyze different Boroughs by most types of venues they have

In [385]:
borough_temp_df
Out[385]:
BOROUGH LONGITUDE LATITUDE
0 BRONX -73.884201 40.847302
1 MANHATTAN -73.979645 40.758475
2 QUEENS -73.838414 40.719263
3 BROOKLYN -73.961086 40.661829
4 STATEN ISLAND -74.135433 40.586917

Foursquare API and Credentials

In [386]:
CLIENT_ID = 'QUSLVNRGP23EPMS0B4EFN2TIMKOHSCGXXU1MOMBF04CXL4PH' # your Foursquare ID
CLIENT_SECRET = 'F05SSCAKGGAXINRW2JYPYN1FH155TIVYCQDSXHAAHVEY4HIQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
Your credentails:
CLIENT_ID: QUSLVNRGP23EPMS0B4EFN2TIMKOHSCGXXU1MOMBF04CXL4PH
CLIENT_SECRET:F05SSCAKGGAXINRW2JYPYN1FH155TIVYCQDSXHAAHVEY4HIQ

Defining a Function which outputs 100 venues

In [387]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
In [388]:
def venuesFind(latitude, longitude, radius):
    
    radius = 3000
    LIMIT = 100
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        latitude, 
        longitude, 
        radius, 
        LIMIT)
    
    results = requests.get(url).json()
    
    venues = results['response']['groups'][0]['items']
    
    nearby_venues = json_normalize(venues) # flatten JSON

    # filter columns
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
    nearby_venues =nearby_venues.loc[:, filtered_columns]

    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    
    map_borough = folium.Map(location=[latitude, longitude], zoom_start=13)
    
    for row in nearby_venues.itertuples():
        map_borough.add_child(folium.Marker(location=[row.lat, row.lng], popup=row.categories))  

    return nearby_venues, map_borough

BRONX

In [389]:
address = 'Bronx, NY'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Bronx are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Bronx are 40.85048545, -73.8404035580209.
In [390]:
nearby_venues_Bronx , map_bronx = venuesFind(latitude, longitude, 3000)
In [391]:
nearby_venues_Bronx.head()
Out[391]:
name categories lat lng
0 Residence Inn by Marriott New York The Bronx a... Hotel 40.849325 -73.842534
1 LA Fitness Gym / Fitness Center 40.849739 -73.841949
2 Starbucks Coffee Shop 40.851371 -73.844087
3 Zeppieri & Sons Italian Bakery Bakery 40.847119 -73.832057
4 Franks Pizza Pizza Place 40.843550 -73.836006
In [392]:
nearby_venues_Bronx["categories"].value_counts().head(10)
Out[392]:
Pizza Place             13
Italian Restaurant       8
Coffee Shop              5
Spanish Restaurant       4
Mexican Restaurant       4
Deli / Bodega            4
Bakery                   4
Gym / Fitness Center     3
Diner                    3
American Restaurant      3
Name: categories, dtype: int64
In [393]:
map_bronx
Out[393]:

MANHATTAN

In [394]:
address = 'Manhattan, NY'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.
In [395]:
nearby_venues_Manhattan , map_Manhattan = venuesFind(latitude, longitude, 3000)
In [396]:
nearby_venues_Manhattan.head()
Out[396]:
name categories lat lng
0 North Meadow Park 40.792027 -73.959853
1 East Meadow Field 40.790160 -73.955498
2 The Jewish Museum Museum 40.785276 -73.957411
3 Da Capo Café 40.787679 -73.953899
4 Jacqueline Kennedy Onassis Reservoir Reservoir 40.785153 -73.962935
In [397]:
nearby_venues_Manhattan["categories"].value_counts().head(10)
Out[397]:
Park                   8
Exhibit                6
Bakery                 4
Wine Shop              4
Bookstore              3
American Restaurant    3
Coffee Shop            3
Bar                    3
Garden                 3
Seafood Restaurant     3
Name: categories, dtype: int64
In [398]:
map_Manhattan
Out[398]:

BROOKLYN

In [399]:
address = 'Brooklyn, NY'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.
In [400]:
nearby_venues_Brooklyn , map_Brooklyn = venuesFind(latitude, longitude, 3000)
In [401]:
nearby_venues_Brooklyn.head()
Out[401]:
name categories lat lng
0 Exquisite Delight Caribbean Restaurant 40.649091 -73.949243
1 Jus Juice Juice Bar 40.652208 -73.949723
2 Kings Theatre Theater 40.646110 -73.957175
3 Gandhi Fine Indian Cuisine Indian Restaurant 40.655168 -73.956325
4 Nostrand Health Foods Juice Bar 40.649092 -73.949243
In [402]:
nearby_venues_Brooklyn["categories"].value_counts().head(10)
Out[402]:
Caribbean Restaurant    11
Café                     5
Park                     4
Cocktail Bar             4
Bakery                   4
Coffee Shop              3
Garden                   3
Thai Restaurant          2
Grocery Store            2
Music Venue              2
Name: categories, dtype: int64
In [403]:
map_Brooklyn
Out[403]:

QUEENS

In [404]:
address = 'Queens, NY'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Queens are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Queens are 40.6524927, -73.7914214158161.
In [405]:
nearby_venues_Queens , map_Queens = venuesFind(latitude, longitude, 3000)
In [406]:
nearby_venues_Queens.head()
Out[406]:
name categories lat lng
0 L'Occitane en Provence Cosmetics Shop 40.650343 -73.792202
1 American Airlines Flagship Lounge Airport Lounge 40.648101 -73.791583
2 Shake Shack Burger Joint 40.642284 -73.783273
3 Starbucks Coffee Shop 40.650657 -73.791711
4 TWA Flight Center Historic Site 40.643573 -73.782032
In [407]:
nearby_venues_Queens["categories"].value_counts().head(10)
Out[407]:
Airport Lounge                     9
Rental Car Location                8
Cosmetics Shop                     6
Coffee Shop                        5
Donut Shop                         4
Food Truck                         3
Southern / Soul Food Restaurant    3
Caribbean Restaurant               3
Mexican Restaurant                 3
Ice Cream Shop                     3
Name: categories, dtype: int64
In [408]:
map_Queens
Out[408]:

STATEN ISLAND

In [409]:
address = 'Staten Island, NY'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Staten Island are {}, {}.'.format(latitude, longitude))
The geograpical coordinate of Staten Island are 40.5834557, -74.1496048.
In [410]:
nearby_venues_SI , map_SI = venuesFind(latitude, longitude, 3000)
In [411]:
nearby_venues_SI.head()
Out[411]:
name categories lat lng
0 Greenbelt Nature Center Trail 40.586616 -74.146917
1 LaTourette Park & Golf Course Golf Course 40.575603 -74.147487
2 Trader Joe's Grocery Store 40.589997 -74.165715
3 Bath & Body Works Cosmetics Shop 40.582460 -74.165753
4 Disney store Toy / Game Store 40.581963 -74.166272
In [412]:
nearby_venues_SI["categories"].value_counts().head(10)
Out[412]:
Italian Restaurant        5
Clothing Store            4
Department Store          3
Furniture / Home Store    3
Restaurant                3
Bakery                    2
Cosmetics Shop            2
Diner                     2
Campground                2
Bagel Shop                2
Name: categories, dtype: int64
In [413]:
map_SI
Out[413]:
In [ ]: